NLTK 자연어 처리 패키지 소개

형태소 분석 해주는 라이브러리가 NLTK

NLTK(Natural Language Toolkit) 패키지는 교육용으로 개발된 자연어 처리 및 문서 분석용 파이썬 패키지다. 다양한 기능 및 예제를 가지고 있으며 실무 및 연구에서도 많이 사용된다.

NLTK 패키지가 제공하는 주요 기능은 다음과 같다.

샘플 corpus 및 사전
토큰 생성(tokenizing)
형태소 분석(stemming/lemmatizing)
품사 태깅(part-of-speech tagging)
구문 분석(syntax parsing)

샘플 corpus

샘플 corpus 및 사전
우리말로 말뭉치. 트레이닝 데이터
corpus는 여러 가지가 있는데 그 중의 일부만 리스트업을 해두었다.
태깅은 사람이 해야 한다. 태깅 정보가 다 들어 있다. 워낙 양이 많다보니 하나하나 직접 설치를 해보았다.

corpus는 분석 작업을 위한 샘플 문서 집합을 말한다. 단순히 소설, 신문 등의 문서를 모아놓은 것도 있지만 대부분 품사. 형태소, 등의 보조적 의미를 추가하고 쉬운 분석을 위해 구조적인 형태로 정리해 놓은 것이 많다.

NLTK 패키지의 corpus 서브패키지에서는 다음과 같은 다양한 연구용 corpus를 제공한다. 이 목록은 전체 corpus의 일부일 뿐이다.

averaged_perceptron_tagger Averaged Perceptron Tagger
book_grammars: Grammars from NLTK Book
brown: Brown Corpus
chat80: Chat-80 Data Files
city_database: City Database
comparative_sentences Comparative Sentence Dataset
dependency_treebank. Dependency Parsed Treebank
gutenberg: Project Gutenberg Selections
hmm_treebank_pos_tagger Treebank Part of Speech Tagger (HMM)
inaugural: C-Span Inaugural Address Corpus
large_grammars: Large context-free and feature-based grammars for parser comparison
mac_morpho: MAC-MORPHO: Brazilian Portuguese news text with part-of-speech tags
masc_tagged: MASC Tagged Corpus
maxent_ne_chunker: ACE Named Entity Chunker (Maximum entropy)
maxent_treebank_pos_tagger Treebank Part of Speech Tagger (Maximum entropy)
movie_reviews: Sentiment Polarity Dataset Version 2.0
names: Names Corpus, Version 1.3 (1994-03-29)
nps_chat: NPS Chat
omw: Open Multilingual Wordnet
opinion_lexicon: Opinion Lexicon
pros_cons: Pros and Cons
ptb: Penn Treebank
punkt: Punkt Tokenizer Models
reuters: The Reuters-21578 benchmark corpus, ApteMod version
sample_grammars: Sample Grammars
sentence_polarity: Sentence Polarity Dataset v1.0
sentiwordnet: SentiWordNet
snowball_data: Snowball Data
stopwords: Stopwords Corpus
subjectivity: Subjectivity Dataset v1.0
tagsets: Help on Tagsets
treebank: Penn Treebank Sample
twitter_samples: Twitter Samples
unicode_samples: Unicode Samples
universal_tagset: Mappings to the Universal Part-of-Speech Tagset
universal_treebanks_v20 Universal Treebanks Version 2.0
verbnet: VerbNet Lexicon, Version 2.1
webtext: Web Text Corpus
word2vec_sample: Word2Vec Sample
wordnet: WordNet
words: Word Lists

이러한 corpus 자료는 설치시에 제공되는 것이 아니라 download 명령으로 사용자가 다운로드 받아야 한다.



In [1]:

    
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download("gutenberg")
nltk.download('punkt')
nltk.download('reuters')
nltk.download("stopwords")
nltk.download("taggers")
nltk.download("webtext")
nltk.download("wordnet")









    



[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data]     index
[nltk_data] Downloading package webtext to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!






    Out[1]:





True



In [3]:

    
nltk.corpus.gutenberg.fileids()









    Out[3]:





['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']



In [4]:

    
emma_raw = nltk.corpus.gutenberg.raw("austen-emma.txt")
print(emma_raw[:1302])









    



[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.  Between _them_ it was more the intimacy
of sisters.  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness of her temper had hardly allowed
her to impose any restraint; and the shadow of authority being
now long passed away, they had been living together as friend and
friend very mutually attached, and Emma doing just what she liked;
highly esteeming Miss Taylor's judgment, but directed chiefly by
her own.

토큰 생성(tokenizing)

문장을 쪼개는 것. 전체 텍스트를 쪼개는 과정. 어떻게 쪼개느냐? 그건 다 다르다. 자기가 원하는 방식으로

문서를 분석하기 위해서는 우선 긴 문자열을 분석을 위한 작은 단위로 나누어야 한다. 이 문자열 단위를 토큰(token)이라고 한다.



In [5]:

    
from nltk.tokenize import word_tokenize
word_tokenize(emma_raw[50:100])   #,도 별도로 token으로 만들어준다.









    Out[5]:





['Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and',
 'rich',
 ',',
 'with',
 'a']



In [7]:

    
from nltk.tokenize import RegexpTokenizer
t = RegexpTokenizer("[\w]+")   #글자로 인식할 수 있는 거 1개 이상. ,는 생략
t.tokenize(emma_raw[50:100])









    Out[7]:





['Emma', 'Woodhouse', 'handsome', 'clever', 'and', 'rich', 'with', 'a']



In [8]:

    
from nltk.tokenize import sent_tokenize
print(sent_tokenize(emma_raw[:1000])[3])









    



Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.

형태소 분석

형태소 분석이란 어근, 접두사/접미사, 품사(POS, part-of-speech) 등 다양한 언어적 속성의 구조를 파악하는 작업이다. 구체적으로는 다음과 같은 작업으로 나뉜다.

stemming (어근 추출)
lemmatizing (원형 복원)
POS tagging (품사 태깅)

Stemming and lemmatizing



In [9]:

    
from nltk.stem import PorterStemmer
st = PorterStemmer()
st.stem("eating")









    Out[9]:





'eat'



In [10]:

    
from nltk.stem import LancasterStemmer
st = LancasterStemmer()
st.stem("shopping")









    Out[10]:





'shop'



In [11]:

    
from nltk.stem import RegexpStemmer
st = RegexpStemmer("ing")
st.stem("cooking")









    Out[11]:





'cook'



In [12]:

    
from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()
print(lm.lemmatize("cooking"))
print(lm.lemmatize("cooking", pos="v"))
print(lm.lemmatize("cookbooks"))









    



cooking
cook
cookbook



In [13]:

    
print(WordNetLemmatizer().lemmatize("believes"))
print(LancasterStemmer().stem("believes"))









    



belief
believ

POS tagging

POS(part-of-speech)는 품사를 말한다.

Part-of-Speech Tagset
- https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.htm
- http://www.ibm.com/support/knowledgecenter/ko/SS5RWK_3.5.0/com.ibm.discovery.es.ta.doc/iiysspostagset.htm



In [14]:

    
from nltk.tag import pos_tag
tagged_list = pos_tag(word_tokenize(emma_raw[:100]))
tagged_list









    Out[14]:





[('[', 'NNS'),
 ('Emma', 'NNP'),
 ('by', 'IN'),
 ('Jane', 'NNP'),
 ('Austen', 'NNP'),
 ('1816', 'CD'),
 (']', 'NNP'),
 ('VOLUME', 'NNP'),
 ('I', 'PRP'),
 ('CHAPTER', 'VBP'),
 ('I', 'PRP'),
 ('Emma', 'NNP'),
 ('Woodhouse', 'NNP'),
 (',', ','),
 ('handsome', 'NN'),
 (',', ','),
 ('clever', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('rich', 'JJ'),
 (',', ','),
 ('with', 'IN'),
 ('a', 'DT')]



In [15]:

    
from nltk.tag import untag
untag(tagged_list)









    Out[15]:





['[',
 'Emma',
 'by',
 'Jane',
 'Austen',
 '1816',
 ']',
 'VOLUME',
 'I',
 'CHAPTER',
 'I',
 'Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and',
 'rich',
 ',',
 'with',
 'a']